What can linguistics contribute to event extraction?
نویسندگان
چکیده
This paper examines the question of how a linguistic analysis of a written document can contribute to identifying, tracking and populating the “eventualities” that are presented in the document, either directly or indirectly, and representing degrees of belief concerning them. It is our view that the role of lexical analysis (as exemplified in the research carried out in the FrameNet project) is greater than usually assumed, so this paper is partly an attempt to clarify the boundary between on the one hand the information that can be derived on the basis of linguistic knowledge alone (composed of lexical meanings and the meanings of grammatical constructions) and on the other hand, reasoning based on beliefs about the source of a document, world knowledge, and “common sense”. Since the general linguistic processes described in this paper will apply to eventualities in general (by which we mean acts, happenings, states of affairs, and relations, whether real, proposed, imagined, or denied ), our presentation will emphasize the linguistic processes themselves. In particular, we show that the kind of information produced by the lexicon-building project FrameNet can have a special role in contributing to text understanding, starting from the basic facts of the combinatorial properties of frame-bearing words (verbs, nouns, adjectives and prepositions) and arriving at the means of recognizing the anaphoric properties of specific unexpressed event participants, for all parts of speech, in defining a new layer of anaphora resolution and text cohesion. Using as a starting point the challenge text presented in the call for this workshop (hereafter referred to as the Hijacking text), we show the points at which a thorough linguistic analysis can articulate with the kind of simulation formalism demonstrated in X-schema diagram, Figure 2 , which itself incorporates a great deal of world knowledge connected with the events introduced in the Hijacking text. Valence and Text Understanding We will begin our discussion of the ongoing FrameNet lexicon-building work with those parts that more or less match familiar linguistic analysis and are suffi cient for dealing with the linguistic analysis of the Hijacking text, and then survey further kinds of linguistic information that (1) make it possible to recognize attributions of truth claims, or (2) identify entities and eventualities that are not explicitly mentioned in given sentences. Copyright c © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. The FrameNet project (http://framenet.icsi. berkeley.edu, Fontenelle 2003) is devoted to discovering and describing the lexical valences of lexical units in English, that is, their semantic and syntactic combinatorial properties, and how these properties can be used for identifying and populating the eventualities that are linguistically coded in a document. The most straightforward way in which eventualities can be found and fi lled in is through (1) the recognition of frame-bearing words that designate eventualities of particular types and (2) the identifi cation of phrases in the syntactic context of such words that denote participants (“slot fi llers”) in these eventualities. Consider the sentence The bodyguard poisoned the emperor; here, the transitive verb poison designates an event type in which a person administers poison to a living being, and this sentence identifi es the person referred to as the bodyguard as the agent of such an act, and the person referred to as the emperor as its victim. Given that the sentence has a past-tense form, we can say that the document claims this to be an actual occurrence, and the sentence invites the expectation (through the use of the and the simple past tense) that an earlier part of the text offers further details about the individuals and the time of the event. The adjective fond, as in Our daughters are fond of horses, designates a particular positive psychological state, and in this sentence the holder of the attitude is identifi ed as our daughters, and the object of this state is taken to be horses in general. As a present-tense sentence it can be seen as claiming that this is a more or less permanent disposition of the individuals referred to as our daughters. The noun advice, found in John’s advice to his daughter, designates a conversational event of advice-giving, and in this sentence John is presented as the communicator in an instance of this event type and his daughter is the addressee. Since this is a noun phrase, temporal coordinates are missing, but in a past-tense referring context like John’s advice to his daughter was wise, we can say that the text claims that such an event did occurr. In each case further details about the events and the participants, including the content of the advice, are likely to be found in surrounding parts of the text. Almost all of the complexity of the Hijacking text can be dealt with through the interpretation of its frame-bearing verbs and nouns; the question of event order requires further kinds of reasoning and evidence, some (but not all) of it based in the sentence’s linguistic form. In the Hijacking sentence, (1) The United Nations says Somali gunmen who hijacked a U.N.-chartered vessel carrying food aid for tsunami victims have released the ship after holding it for more than two months. the main frame-bearing lexical units that evoke event types are says, hijacked, chartered, carrying, aid, tsunami, victims, released, and holding. Named entities in the sentence include United Nations, Somali, and U.N.; names of other entities are gunmen, vessel, food, and ship; and timemarking expressions include the aspect-marking have, the conjunction after, and the phrase for more than two months. A dependency parse of the sentence is given as Figure 1; frames are associated with frame-bearing words; frame elements are associated with the dependent phrases in the diagram. The events identifi ed in the narrative, in the order of their occurrence, are the tsunami, the chartering of a ship, the transporting of food aid (interrupted), the hijacking, the illegal retention of the ship, the ship’s release, and the announcement about the release. The meaning of after shows that the releasing followed the holding; the fact that the reported event is have released rather than will release shows that the releasing event preceded the reporting event; since the vessel was carrying food aid at the time of the hijacking, the chartering and launching of the rescue vessel preceded the hijacking. Since in principle one can carry food aid for potential victims of a predicted tsunami, and it is not a linguistic fact that one does not charter a ship that is already on a mission, the ordering of these sub-events cannot be determined on linguistic grounds alone. The category of event-introducing frame-evoking words is not limited to verbs. To be a victim is to be a participant in some unfortunate event, and this event is generally expressed as a modifi er of the noun: an X victim, or a victim of X. The noun tsunami names an event which itself has no obligatory syntactic dependents, though it can have qualifi ers indicating location, time, intensity, etc. In addition to expressions that directly point to the event that produced victims (the X in the above phrasings), it is possible to interpret the word victim alone as implying the existence of such, in a context in which information about the causing event is recoverable nearby. Thus, whenever we encounter a sentence like Were there any victims? it must be the case that mention of the mishap can be found in a recent part of the discourse. There are many linguistic ways in which information about eventualities and claims about their actuality can be presented in a sentence. These include, most straightforwardly, frame-evoking verbs and nouns, and the ways in which the phrases they are in grammatical construction with contribute to “populating” (“fi lling the slots of”) the frames. Verbs whose complements (including direct objects) are understood as participants in the frames they evoke are amply illustrated in the Hijacking text: the UN says the proposition about the ship’s release; the gunmen hijacked the ship; the gunmen released the ship; the ship carried food aid; the hijackers held the ship for several weeks. Information about reliability and source-attributions of event reports can be provided in meta-data and in textinternal mention of evidential sources. The Hijacking text itself is attributed (in the call for papers) to the Voice of America: knowing the mission of the VOA might be relevant to evaluating the reliability of the report, but that is not a linguistic matter. The reporting of the hijacking and releasing incident is attributed to the United Nations: again, evaluating the truth of the report by considering its source is not a linguistic matter.
منابع مشابه
Domain-Independent Detection, Extraction, and Labeling of Atomic Events
The notion of an “event” has been widely used in the computational linguistics literature as well as in information retrieval and various NLP applications, although with significant variance in what exactly an event is. We describe an empirical study aimed at developing an operational definition of an event at the atomic (sentence or predicate) level, and use our observations to create a system...
متن کاملExploring the Conceptions of Academic Reading Comprehension by Iranian Graduate Students of Applied Linguistics
Although the importance of reading in higher education as an index of success has been highlighted, the metacognitive knowledge or beliefs of graduate students have remained under-researched. This qualitative study reports on a study that, first, examines how graduate students of applied linguistics conceive of academic reading and academic readers in their graduate programs; second, wh...
متن کاملBuilding Chinese Event Type Paradigm Based on Trigger Clustering
Traditional Event Extraction mainly focuses on event type identification and event participants extraction based on pre-specified event type annotations. However, different domains have different event type paradigms. When transferring to a new domain, we have to build a new event type paradigm. It is a costly task to discover and annotate event types manually. To address this problem, this pap...
متن کاملComparative Study of Nominalization in Applied Linguistics and Biology Books
This study explored nominalized expression types in an applied linguistics book and a biology book as 2 distinct disciplines. The books were carefully read, the nominalized expression types were identified, the frequencies of the nominalization types were counted, and eventually chi-square was administered. Results revealed no significant difference in using nominalization. Furthermore, the den...
متن کاملAcquiring Topic Features to improve Event Extraction: in Pre-selected and Balanced Collections
Event extraction is a particularly challenging type of information extraction (IE) that may require inferences from the whole article. However, most current event extraction systems rely on local information at the phrase or sentence level, and do not consider the article as a whole, thus limiting extraction performance. Moreover, most annotated corpora are artificially enriched to include enou...
متن کاملAutomatically Generated Noun Lexicons for Event Extraction
In this paper, we propose a method for creating automatically weighted lexicons of event names. Almost all names of events are ambiguous in context (i.e., they can be interpreted in an eventive or noneventive reading). Therefore, weights representing the relative eventiveness of a noun can help for disambiguating event detection in texts. We applied our method on both French and English corpora...
متن کامل